generating labeled training data
Fabricator: An Open Source Toolkit for Generating Labeled Training Data with Teacher LLMs
Golde, Jonas, Haller, Patrick, Hamborg, Felix, Risch, Julian, Akbik, Alan
Most NLP tasks are modeled as supervised learning and thus require labeled training data to train effective models. However, manually producing such data at sufficient quality and quantity is known to be costly and time-intensive. Current research addresses this bottleneck by exploring a novel paradigm called zero-shot learning via dataset generation. Here, a powerful LLM is prompted with a task description to generate labeled data that can be used to train a downstream NLP model. For instance, an LLM might be prompted to "generate 500 movie reviews with positive overall sentiment, and another 500 with negative sentiment." The generated data could then be used to train a binary sentiment classifier, effectively leveraging an LLM as a teacher to a smaller student model. With this demo, we introduce Fabricator, an open-source Python toolkit for dataset generation. Fabricator implements common dataset generation workflows, supports a wide range of downstream NLP tasks (such as text classification, question answering, and entity recognition), and is integrated with well-known libraries to facilitate quick experimentation. With Fabricator, we aim to support researchers in conducting reproducible dataset generation experiments using LLMs and help practitioners apply this approach to train models for downstream tasks.
[Podcast] Generating Labeled Training Data for Your ML/AI Models
If you're not already a listener of the "This Week in Machine Learning & AI" podcast, today is an opportune day to become one. Mighty AI's own Principal Data Scientist, Angie Hugeback, was show host and creator Sam Charrington's most recent guest, and the episode turned out pretty dang great if we do say so ourselves. Sam and Angie chatted about cool data science and machine learning projects Angie has lead and been a part of throughout her career, the many challenges and considerations involved in generating high-quality AI training data, and more. Give it a listen if you're interested in: Hit play below or check it out on iTunes, SoundCloud, Google Play, or Stitcher. A big thank you goes out to Sam at This Week in Machine Learning & AI for the opportunity.